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Abstract 


A recurring problem when building probabilistic latent variable models is regu¬ 
larization and model selection. For instance, the choice of the dimensionality of 
the latent space. In the context of belief networks with latent variables, this prob¬ 
lem can be adressed with Automatic Relevance Determination (ARD) employing 
Monte Carlo inference. We present a variational inference approach to ARD for 
Deep Generative Models using doubly stochastic variational inference to provide 
fast and scalable learning. We show empirical results on a standard dataset il¬ 
lustrating the effects of contracting the latent space automatically. We show that 
the resulting latent representations are significantly more compact without loss of 
expressive power of the learned models. 


1 Introduction 

In recent years, probabilistic treatments of deep learning architectures have spurred a renewed inter¬ 
est in the marriage of Graphical Models with the expressiveness of neural networks. While Bayesian 
inference has traditionally been hard for nonlinear graphical models and based on sampling, recent 
efforts have focused more on deterministic approximate inference. Variational inference methods 
for stochastic belief networks such as stochastic feedforward networks ca, the variational autoen¬ 
coder m and related techniques canaiEi have been proposed and applied successfully in 
various contexts. 

Since these models are highly competitive as density estimators and bring the promise of being used 
as parts of larger graphical models, a fruitful area of research is the understanding and incorporation 
of structure on the latent spaces learned by such models. A key problem that remains unadressed 
in the context of these models is a principled criterion to determine the number of latent variables 
needed to model the data, especially in high-dimensional regimes. 

While a trivial approach to selecting the dimensionality of the latent space is line-search over dif¬ 
ferent settings with the marginal likelihood as a decision criterion, it is commonly of interest in 
probabilistic models to specify a large number for the available dimensionality and let the data evi¬ 
dence prune the useless dimensions in the posterior. 

In this work we will study the use of doubly stochastic variational inference and the addition of a 
prior distribution over relevance-weights of latent dimensions to infer the effective dimensionality 
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of the latent space in deep generative models and present experimental evidence of the effects on the 
inferred latent spaces. 

2 Background 

2.1 Related Work 

An initial Bayesian treatment for regularization of Neural Networks was performed in seminal work 
by lUTl and ifTSl . They introduced a notion called Automatic Relevance Determination (ARD), 
which consists of the idea of using a prior distribution on generative weights attached to the latent 
variable which encourages the weights to be zero. Effectively, by integrating over such priors using 
Monte Carlo, settings for the variances of the prior can be inferred from data leading to pruning 
of unnecessary latent dimensions. ARD was also notably used in a variational treatment of the 
relevance vector machine m to infer a mask over the data features needed for a preditive model. 
The idea of relevance determination for belief networks in combination with variational inference 
has also been explored before in Col, but is different from our model. 

Additionally, ARD is a key component of many Gaussian Process models on, where it is used to 
select features of the input data to be passed through a covariance function. A similarly inspired 
model to ours uses Bayesian Gaussian Process Latent Variable Models with ARD 01 to learn rich 
latent spaces on multiple views, with the main difference being the use of Gaussian Processes instead 
of nonlinear parametric latent variable models. Finally, a related inference method was presented 
in 021 , where doubly stochastic variational inference was used for relevance determination on the 
input weights in logistic regression, but not in the context of deep generative models. 

Apart from ARD, there is other recent work regularizing and manipulating deep generative models 
to add structure to the latent space. In ||2l a penalty term is introduced to force latent variables to 
decorrelate, while in 13 known factors of variation are manipulated in controlled manual fashion 
during gradient descent to encourage learning of disentangled latent variables. Such approaches 
bring benefits in learning more interpretably shaped latent spaces and result in impressive perfor¬ 
mance gains. 

2.2 Stochastic Variational Bayes For Deep Generative Models 

We are presented with N D-dimensional datapoints x'^ G IZ ^. We are interested in learning a proba¬ 
bility distribution which captures the data well and thus maximizes p(x), for instance an embedding. 
Let’s assume a directed graphical model with latent variables z* corresponding to observed variables 
The latent variables can be drawn from any exponential family distribution p{z), but simplifying 
cases for inference and learning exist for many continuous distributions. 

Given a differentiable function / parametrized by weights 0, such as a multi-layer perceptron (MLP), 
we can write the model as follows: 


with 



P0{x\z)p{z)dz 


Z 


( 1 ) 


P 0 (x\z) = f(x;z,0). (2) 

A good estimator for learning the parameters of such a model by assuming an approximate con¬ 
ditional posterior q(f){z\x) was suggested in [141 US, all of which can generally be understood 
as instances of doubly stochastic variational inference. The estimator forms a variational lower 
bound dmi to the marginal likelihood of the data. Thus, performing coordinate ascent with re¬ 
spect to variational parameters 0 and 0 corresponds to minimizing the divergence between the true 
posterior and the approximate one: 


logp 0 {a;W) > £( 6 »,(/);a;W) = -KL{q^{z)\\p0(z))+Eg^(;,)[logp0{x^^'>\z)]. (3) 
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We will denote this model with SGVB from now on. SGVB is quite robust to overfitting, as it 
optimizes a variational lower bound acting as a strong regularizes 

3 Relevance Determination For Deep Generative Models 

We already pointed out the robustness of variational inference to overfitting. However, it is unclear 
whether this robustness extends to high-dimensional spaces. Regularization of complexity in deep 
unsupervised generative models can be performed in various ways: regularizing the previously de¬ 
terministic weights, for instance by incorporating uncertainty using prior distributions on weights 
as suggested in the supplementary material of (61, or inclusion of another stochastic variable which 
explicitly only controls the complexity of the prior in order to perform model selection. Our contri¬ 
bution is concerned with the latter case, which is a simpler setting as it involves integrating over a 
relatively low-dimensional latent variable rather than all the weights in a complex network. In order 
to address the problem of model selection for the dimensionality of the latent space, we re-introduce 
relevance determination as a sparsifying factor in the latent space. 

We can add relevance weights to the generative model by multiplying each latent variable dimension 
Zd with a corresponding relevance weight Wd- We model w using a Gaussian prior distribution 
p{w) = JV{w; 0, A), where A = diag(Ai, denotes the diagonal covariance matrix. 

The model then changes by incorporating two parents to variable x, namely variables 2 : and w: 

p{x;0) = j j pe{x\z^w)p(w)p{z)dzdw (4) 

w z 


with 


p{x\z, w) = f{x; zGw,0) (5) 

In Figure we sketch how this corresponds to a graphical model with a more complicated factorial 
prior distribution than the one considered in la. 



Figure 1: a. Graphical Model of SGVB. b. Graphical Model showing (G)SGVB-ARD 

The particular structure of the relevance weights shows that they form a mask over the latent variable 
z, which is an input to the generative model pq. The goal of adding this structure to the model is to 
facilitate the relevance weights to automatically reach distributions peaked around zero, in order to 
effectively prune dimensions of the input, which is a latent variable. Apart from regularization and 
model selection the added variable also provides interpretability when using the model after being 
successfully inferred, by allowing the user to inspect the relevant latent variables directly. 

Learning requires inferring optimal values for hyperparameters A from the data. If A becomes 
arbitrarily close to zero in any dimension, this dimension becomes meaningless to the generative 
model, as the prior has a fixed mean value of 0. We introduce an approximate posterior distribution 
over the weights q{w;r) with Td = {l^T,d^ ^r,d} for each dimension d. The goal now becomes to 
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infer q-r and use it to perform likelihood maximization on the prior parameters A using the data 
evidence. Such a procedure is called evidence maximization or empirical Bayes. An alternative 
would be to set a prior distribution over A, like a Gamma distribution, and infer the posterior values 
of A, at additional cost and complexity. In our case empirical Bayes corresponds to formulating 
and maximizing a lower bound to the marginal likelihood of the data with respect to the involved 
variational parameters 6>, 0, r and the hyperparameters A. The bound can be derived using rigorous 
variational inference principles caia and becomes as follows: 


= -KL{q4>{z)\\p0{z)) - KL{q-r{w)\\p0{w)) + Eg^(^^)[Eg^(^^)[logp0{x^^'>\z,w)]]. 

( 6 ) 


As we can see, it involves nested expectations over approximate distributions and qcj^, which are 
intractable to calculate analytically. We can apply the reparametrization trick from laiHEi and 
use doubly stochastic variational inference to approximate the double expectation using just small 
numbers of samples w\ from the approximate distributions qr{w), q(f){z) respectively. 

The bound then becomes: 


c{e,(p,T,A-,x^^'>) 


-KL{q<j,{z)\\p0{z)) - KL{qr{w)\\p0{w)) + 




Nn,^^N, 

I k 


z^w%. 

(7) 


and can be differentiated with respect to 6>, 0, r and A. We will denote this algorithm with SGVB- 
ARD henceforth. 


Special Case: Gaussian Prior In the common case of a Gaussian prior on the latent space, we 
can exploit the model structure to simplify the variational objective. We notice the specihc structure 
in the model p{x) = f pe{x^'^^\z^w)p{z)p{w)dzdw which involves a product of two normally 

z,w 

distributed variables. We can replace the product with a joint Gaussian distribution over variable 
m = z Qw with p{m) = p{z)p{w) analytically and reach a closed form normal distribution for the 
bound in EquationAnalytically, for the posterior distributions this becomes: 


and similarly for the priors: 


for each dimension. 


^2 _ / -2 , -2\-l 

(8) 


(9) 

= N{m] (Trp 

(10) 

— \^w + ) 

(11) 

Mm “ Mz ”1" Mtu 

(12) 

p{m) =M{m-,firn,crm^) 

(13) 


This reformulation can be helpful when writing the variational bound and applying the 
reparametrization trick since it reduces noise by just having a single expectation over which to 
evaluate the likelihood and a single regularizer in the latent space. We will denote this by GSGVB- 
ARD. 


4 Results 

In lO it was noted that the performance of resulting models is roughly inependent of dimensionality 
and thus robust to picking a high dimensional latent space. In addition, in the more informal Q 
data is shown indicating that the squared weights as an indicator of importance per dimension of 
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the topmost generative layer are automatically pruned when using SGVB and appear constant when 
changing the size of the latent space. We perform a set of experiments to evaluate how robust SGVB 
is empirically to changes in size of latent space with respect to the implicit dimensionality of the 
latent space. We also compare to analogous experiments using SGVB-ARD and GSGVB-ARD. We 
find that SGVB indeed prunes dimensions of 2 ; automatically by pushing weights to zero over time 
during training and appears to use similar amounts of latent variables when changing dimensionality. 
However, using ARD results in (sometimes significantly) more compact models in either cases at no 
cost to training-likelihood and reconstruction quality while typically boosting test-set likelihoods. 
In all experiments we use minibatches of size 200 and found usage of rmsProp O with momentum 
as an optimizer crucial for performance. We furthermore report the learning rates per experiment. 
We initialize the A parameters for the ARD weights to 1, such that ARD weights are free to fluctuate 
strongly a priori in all settings. In all cases we draw just one sample per datapoint in each iteration 
from the approximate variational distributions to perform coordinate ascend. We expect results to 
improve with using more samples. We show detailed results in Table[^and the details in the sections. 


Table 1: Comparison of logP(x) of various models 


Data set 

SGVB 

SGVB-ARD 

GSGVB-ARD 

Frey Faces Train 400dh 50sh 

1162.36 

1179.63 

1160.67 

Frey Faces Train 400dh IOOsh 

975.06(1148.67) 

1209.54 

1146.91 

Frey Faces Test 400dh 50 

619.14 

643.55 

583.87 

Frey Faces Test 400dh IOOsh 

547.17(627.35) 

667.42 

627.10 

Frey Faces Train 200dh 50sh 

1104.88 

1181.84 

1126.82 

Frey Faces Train 200dh IOOsh 

1155.09 

1200.31 

1145.12 

Frey Faces Test 200dh 50sh 

637.47 

667.41 

654.16 

Frey Faces Test 200dh IOOsh 

653.06 

686.95 

673.30 


Table 2: Comparison of number retained latent variables 


Data set 

SGVB 

SGVB-ARD 

GSGVB-ARD 

Frey Faces 400dh 50sh 

27(30) 

9 

19(9) 

Frey Faces 400dh IOOsh 

26 

9 

21 (9) 

Frey Faces 200dh 50sh 

21 

9 

15 (9) 

Frey Faces 200dh IOOsh 

16 

8 

11 (9) 


4.1 Frey Faces 

The Frey Faces dataset is a real-valued dataset showing faces in similar conditions. As it is not very 
large (under 2000 datapoints), it is prone to lead to overfitting when using hig-dimensional models 
in general. We separate the model into 1600 training samples and 365 test samples. We use 200 and 
400 hidden rectified linear units, a Gaussian likelihood and latent variable with a zero mean and unit 
variance pior with 50 and 100 latent dimensions for a total of 4 experiments. We use a learning rate 
of 0.0001 in these experiments. 

SGVB We ran SGVB for the aforemetioned 4 experiments for 10000 iterations over the whole 
training-set. In this model, we observe that SGVB takes advantage of a large number of available 
latent dimensions, but eventually keeps between 16 and 27 dimensions respectively as input, see 
Table 1^ and Figure Due to strong sampling noise when evaluating the likelihood we average over 
the test-scores of the last 100 iterations to reach the reported test-likelihoods of 637.47 nats for the 
50-dimensional model and 653.06 nats for the 100-dimensional model with 200 deterministic hidden 
units. The corresponding training scores are 1104.88 nats and 1155.09 nats, showing superiority of 
the model with more parameters on training data. 

When using 400 hidden deterministic units, learning becomes unstable due to the paucity of data. 
We report best achieved scores in brackets in addition to averages. We observe similar trends as 
before, but the model uses 27/26 latent dimensions now instead of the 21 or 16 used with the better 
inferred model, indicating effects of overfitting. 
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Figure 2: SGVB tile showing results for SGVB without ARD. To the right, an original image and 
the mean reconstruction of a model sample with variance over it. The middle shows a traceplot 
of the learning curve. The leftmost figure shows the squared sum of weights linked to each latent 
dimension, showing that SGVB prunes weights when needed. 






Figure 3: (G)SGVB-ARD tile (a) shows converged plots for SGVB-ARD with 100 hidden units. On 
the left, a reconstruction is shown at the end of training. The middle shows the learning traceplot. 
On the left, the relevance weight prior parameters and posterior parameters are shown, (b) shows 
the same plots for GSGVB-ARD. 


SGVB-ARD We use the same settings for SGVB-ARD, to provide interpretable outcomes. During 
training we observe a very different behaviour of the lower bound of the training set. SGVB-ARD 
makes much slower progress than SGVB in the initial training phase. Monitoring reconstructions 
and relevance weights help us to understand this behaviour better. Until about iteration 900, in both 
models with 50 and 100 latent variables, the variational bound makes slow progress while improv¬ 
ing the reconstruction quality of the images slower than with SGVB. However, the uncertainty over 
the relevance weights is reduced systematically in these first 900 iterations. At about iteration 900, 
a phase switch occurs during which certain dimensions are pruned by setting relevance weights to 
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small values and the model switches into a noisier training mode of optimizing reconstruction qual¬ 
ity and exhibits behaviour closer to SGVB with fewer latent dimensions. After this initial phase 
SGVB-ARD makes faster progress during training. This can largely be attributed to usage of em¬ 
pirical Bayes, as the model initially deals with uncertainty in the relevance variables and initializes 
generative weights to a favourable regime. However, we observe that this uncertainty is gradually 
reduced over 1000 iterations instead of reaching a local minimum very quickly. For all the described 
effects also see Figure 

We observe in Tablethat SGVB-ARD retains 9 (once 8) latent variables stably across all settings, 
which is a significantly lower number than with the other approaches. While it can be argued that this 
may be a local minimum due to empirical Bayes, performance indicates that these models perform 
very well as they reach higher likelihoods in training and testing across all settings (see Tableand 
show remarkable robustness to the data-poor setting when using 400 determainistic latent variables. 

GSGVB-ARD GSGVB-ARD has the theoretical advantage of evaluating a single expectation 
rather than dealing with the stochastic approximations to two nested expectations. We expect this 
to result in This results in learning behaviour that resembles SGVB a lot more than SGVB-ARD, as 
no explicit phase-shift occurs during training and gradient noise is fairly stable. Inspecting the ARD 
weights also shows that they are not converged after the approximately 1000 iterations SGVB-ARD 
needs, prompting us to believe that learning is more evenly distributed over the learning process and 
thus more robust. 

In terms of performance, we note that GSGVB-ARD also successfully prunes latent dimensions, 
produces qualitatively pleasing reconstructions and achieves high likelihoods. However, we note 
that it differs in performance from SGVB-ARD in two notable ways: First, GSGVB-ARD prunes 
less dimensions than SGVB-ARD but still more than SGVB. Interestingly it does this adaptively, 
as it retains more dimensions in the 400 hidden units setting than in the 200 hidden units setting. 
Interestingly, close inspection of the relevance weights shows that many of the retained ones are 
close to 0 and the ones with significant posterior mass appear to be 9, which would indicate that 
similar models are retrieved as with SGVB-ARD but that training is not fully converged yet. Second, 
in terms of likelihood, GSGVB-ARD performs on par with SGVB, but is outperformed by SGVB- 
ARD. However, learning was observed to be more stable to extremely noisy gradient steps than in 
the case of SGVB in the overfitting regime (400 units), which may be explained by the regularizing 
infiuence of the mask on the gradients. We also note that after 10000 iterations GSGVB-ARD was 
still improving, which may indicate that it may benefit from longer training times. 

In summary, we observe that ARD benefits performance and compactifies learned models across 
the board. While SGVB-ARD exhibits 2 distinct phases in learning which may lead to fears of 
overfitting in more general settings, this issue is largely adressed in GSGVB-ARD due to analytical 
integration which smooths out the progress in learning across the two prior variables and leads to 
similar results. 


5 Discussion 

We have presented a method to perform fast, scalable feature selection in latent space using au¬ 
tomatic relevance determination in nonlinear latent variable models trained with doubly stochas¬ 
tic variational inference. We observe that the application of ARD has strong contractive effects 
on the learned models in terms of variational compression and thus acts as a useful regularizes 
In the context of unsupervised learning with disentangled and irreducible representations in mind, 
compactness is a desireable property for generative models with the added benefit of speeding up 
computation. The reduction of the latent space propagates throughout the learned weights of the 
generative model as evidenced by the strong shrinkage of the top-most weights. This can have a 
profound effect on the learned representation and its expressiveness. We observe that variations in 
dimensionality frequently result in parsimonious final models using ARD, which suggests coherence 
in the learned models. However, further experimental evidence of the empirical behaviour of ARD 
in high-dimensional and data-poor settings is needed to elucidate its benefits. However, apart from 
the empirical success of the demonstrated method, access to a mask indicating relevance of partic¬ 
ular dimensions also supports interpretability of learned models. This is a direct effect of using a 
factorial prior, which we believe will become more common in various contexts in deep generative 
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models. It will be of interest for future work to study the effects of ARD on deeper and more struc¬ 
tured generative models, especially if also used in intermediate layers, in terms of their semantic and 
subjective generative qualities and identifiability of the models in detail. We finally note that a useful 
but potentially computationally costly alternative to ARD would be a fully Bayesian treatment of all 
weights in the network, as the network could learn to prune weights in the posterior when supported 
by data. The fact that just by using variational inference and without explicit shrinkage SGVB can 
learn to prune latent dimensions even without ARD further supports this view. 
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